crowdsourced data
A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images
Jung, Wooyong, Kim, Sola, Kim, Dongwook, Tabar, Maryam, Lee, Dongwon
Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.
- North America > United States > California > San Francisco County > San Francisco (0.27)
- North America > United States > Texas (0.04)
- North America > United States > Pennsylvania (0.04)
- (2 more...)
Machine Learning for Modeling Wireless Radio Metrics with Crowdsourced Data and Local Environment Features
This paper presents a suite of machine learning models, CRC-ML-Radio Metrics, designed for modeling RSRP, RSRQ, and RSSI wireless radio metrics in 4G environments. These models utilize crowdsourced data with local environmental features to enhance prediction accuracy across both indoor at elevation and outdoor urban settings. They achieve RMSE performance of 9.76 to 11.69 dB for RSRP, 2.90 to 3.23 dB for RSRQ, and 9.50 to 10.36 dB for RSSI, evaluated on over 300,000 data points in the Toronto, Montreal, and Vancouver areas. These results demonstrate the robustness and adaptability of the models, supporting precise network planning and quality of service optimization in complex Canadian urban environments.
- North America > Canada > Quebec > Montreal (0.26)
- North America > Canada > Ontario > Toronto (0.26)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- Europe > Denmark (0.04)
RoboCrowd: Scaling Robot Data Collection through Crowdsourcing
Mirchandani, Suvir, Yuan, David D., Burns, Kaylee, Islam, Md Sazzad, Zhao, Tony Z., Finn, Chelsea, Sadigh, Dorsa
In recent years, imitation learning from large-scale human demonstrations has emerged as a promising paradigm for training robot policies. However, the burden of collecting large quantities of human demonstrations is significant in terms of collection time and the need for access to expert operators. We introduce a new data collection paradigm, RoboCrowd, which distributes the workload by utilizing crowdsourcing principles and incentive design. RoboCrowd helps enable scalable data collection and facilitates more efficient learning of robot policies. We build RoboCrowd on top of ALOHA (Zhao et al. 2023) -- a bimanual platform that supports data collection via puppeteering -- to explore the design space for crowdsourcing in-person demonstrations in a public environment. We propose three classes of incentive mechanisms to appeal to users' varying sources of motivation for interacting with the system: material rewards, intrinsic interest, and social comparison. We instantiate these incentives through tasks that include physical rewards, engaging or challenging manipulations, as well as gamification elements such as a leaderboard. We conduct a large-scale, two-week field experiment in which the platform is situated in a university cafe. We observe significant engagement with the system -- over 200 individuals independently volunteered to provide a total of over 800 interaction episodes. Our findings validate the proposed incentives as mechanisms for shaping users' data quantity and quality. Further, we demonstrate that the crowdsourced data can serve as useful pre-training data for policies fine-tuned on expert demonstrations -- boosting performance up to 20% compared to when this data is not available. These results suggest the potential for RoboCrowd to reduce the burden of robot data collection by carefully implementing crowdsourcing and incentive design principles.
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (0.93)
- Leisure & Entertainment > Games > Computer Games (0.48)
- Information Technology > Security & Privacy (0.34)
Mixture of Experts based Multi-task Supervise Learning from Crowds
Han, Tao, Shi, Huaixuan, Ding, Xinyi, Ma, Xiao, Gu, Huamao, Fang, Yili
Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise characterizations and negatively impacting the quality of truth inference. This paper proposes a new paradigm of multi-task supervised learning from crowds, which eliminates the need for modeling of items's ground truth in worker behavior models. Within this paradigm, we propose a worker behavior model at the item feature level called Mixture of Experts based Multi-task Supervised Learning from Crowds (MMLC). Two truth inference strategies are proposed within MMLC. The first strategy, named MMLC-owf, utilizes clustering methods in the worker spectral space to identify the projection vector of the oracle worker. Subsequently, the labels generated based on this vector are considered as the inferred truth. The second strategy, called MMLC-df, employs the MMLC model to fill the crowdsourced data, which can enhance the effectiveness of existing truth inference methods. Experimental results demonstrate that MMLC-owf outperforms state-of-the-art methods and MMLC-df enhances the quality of existing truth inference methods.
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare
Barai, P., Leroy, G., Bisht, P., Rothman, J. M., Lee, S., Andrews, J., Rice, S. A., Ahmed, A.
Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19 percent compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Arizona > Pima County > Tucson (0.04)
- North America > United States > Alaska (0.04)
Data Quality in Crowdsourcing and Spamming Behavior Detection
Ba, Yang, Mancenido, Michelle V., Chiou, Erin K., Pan, Rong
As crowdsourcing emerges as an efficient and cost-effective method for obtaining labels for machine learning datasets, it is important to assess the quality of crowd-provided data, so as to improve analysis performance and reduce biases in subsequent machine learning tasks. Given the lack of ground truth in most cases of crowdsourcing, we refer to data quality as annotators' consistency and credibility. Unlike the simple scenarios where Kappa coefficient and intraclass correlation coefficient usually can apply, online crowdsourcing requires dealing with more complex situations. We introduce a systematic method for evaluating data quality and detecting spamming threats via variance decomposition, and we classify spammers into three categories based on their different behavioral patterns. A spammer index is proposed to assess entire data consistency and two metrics are developed to measure crowd worker's credibility by utilizing the Markov chain and generalized random effects models. Furthermore, we showcase the practicality of our techniques and their advantages by applying them on a face verification task with both simulation and real-world data collected from two crowdsourcing platforms.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Information Technology (0.67)
- Health & Medicine (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
Modeling Large-Scale Walking and Cycling Networks: A Machine Learning Approach Using Mobile Phone and Crowdsourced Data
Saberi, Meead, Lilasathapornkit, Tanapon
Walking and cycling are known to bring substantial health, environmental, and economic advantages. However, the development of evidence-based active transportation planning and policies has been impeded by significant data limitations, such as biases in crowdsourced data and representativeness issues of mobile phone data. In this study, we develop and apply a machine learning based modeling approach for estimating daily walking and cycling volumes across a large-scale regional network in New South Wales, Australia that includes 188,999 walking links and 114,885 cycling links. The modeling methodology leverages crowdsourced and mobile phone data as well as a range of other datasets on population, land use, topography, climate, etc. The study discusses the unique challenges and limitations related to all three aspects of model training, testing, and inference given the large geographical extent of the modeled networks and relative scarcity of observed walking and cycling count data. The study also proposes a new technique to identify model estimate outliers and to mitigate their impact. Overall, the study provides a valuable resource for transportation modelers, policymakers and urban planners seeking to enhance active transportation infrastructure planning and policies with advanced emerging data-driven modeling methodologies.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.48)
- Health & Medicine (1.00)
- Telecommunications (0.93)
- Government > Regional Government > Oceania Government > Australia Government (0.69)
- Transportation > Infrastructure & Services (0.68)
Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models
Wang, Chenguang, Engler, Davis, Li, Xuechun, Hou, James, Wald, David J., Jaiswal, Kishor, Xu, Susu
When a damaging earthquake occurs, immediate information about casualties is critical for time-sensitive decision-making by emergency response and aid agencies in the first hours and days. Systems such as Prompt Assessment of Global Earthquakes for Response (PAGER) by the U.S. Geological Survey (USGS) were developed to provide a forecast within about 30 minutes of any significant earthquake globally. Traditional systems for estimating human loss in disasters often depend on manually collected early casualty reports from global media, a process that's labor-intensive and slow with notable time delays. Recently, some systems have employed keyword matching and topic modeling to extract relevant information from social media. However, these methods struggle with the complex semantics in multilingual texts and the challenge of interpreting ever-changing, often conflicting reports of death and injury numbers from various unverified sources on social media platforms. In this work, we introduce an end-to-end framework to significantly improve the timeliness and accuracy of global earthquake-induced human loss forecasting using multi-lingual, crowdsourced social media. Our framework integrates (1) a hierarchical casualty extraction model built upon large language models, prompt design, and few-shot learning to retrieve quantitative human loss claims from social media, (2) a physical constraint-aware, dynamic-truth discovery model that discovers the truthful human loss from massive noisy and potentially conflicting human loss claims, and (3) a Bayesian updating loss projection model that dynamically updates the final loss estimation using discovered truths. We test the framework in real-time on a series of global earthquake events in 2021 and 2022 and show that our framework streamlines casualty data retrieval, achieving speed and accuracy comparable to manual methods by USGS.
- North America > Haiti (0.50)
- Asia > China > Sichuan Province (0.14)
- Asia > Philippines (0.05)
- (12 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)